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ABSTRACT:  The  first  iteration  of  AFRL’s  Agent-based  Modeling  and  Behavior  Representation  (AMBR)  Model 
Comparison  Project  was  quite  a  learning  experience  for  all  involved.  This  paper  focuses  on  feedback  received, 
challenges  faced,  and  lessons  learned  during  and  after  Round  1  of  the  AMBR  Model  Comparison.  We  include  a  section 
on  implications  for  future  (or  other)  human  model  comparisons,  in  the  hope  that  others  who  may  adopt  this  general 
methodology  will  find  useful  suggestions  for  planning  their  own  human  model  comparisons.  The  paper  ends  with  a 
description  of  current  plans  for  AMBR  Rounds  3  and  4. 


1.  Introduction 

This  paper  is  the  last  of  several  presented  as  part  of  the 
AMBR  Model  Comparison  Symposium  at  the  10*  Annual 
Conference  on  Computer-Generated  Forces  and  Behavior 
Representation.  We  mentioned  in  the  introductory  paper 
for  this  symposium  [1]  that  the  first  goal  of  the  AMBR 
Model  Comparison  Project  is  to  advance  the  state  of  the 
art  in  cognitive  and  behavioral  modeling.  The  other 
Symposium  papers  provide  ample  evidence  that  the 
participating  modeling  architectures  were  challenged  and 
improved  as  a  direct  result  of  their  participation  in  this 
project,  which  we  consider  to  be  an  indication  of  success 
in  advancing  the  state  of  the  art. 

This  positive  outcome  notwithstanding,  we  have  found 
that  the  comparison  of  human  behavior  representation 
(HBR)  models  is  a  challenging  undertaking.  One  reason 
for  this  is  that  it  is  an  unusual  occurrence.  It  is  rare  to 
have  the  opportunity  to  compare  and  contrast  a  variety  of 
models  created  by  developers  who  use  different  model 
architectures  and  draw  their  models  from  different 
theoretical  and  practical  perspectives.  There  are  no  clear 


methodological  guidelines  for  engaging  in  such 
comparisons.  Despite  the  challenges  involved,  it  still  is 
the  case  that  the  AMBR  Model  Comparison  provided  a 
context  in  which  the  participating  architectures  were 
motivated  to  expand  and  improve  their  capabilities.  This 
suggests  that  the  general  design  of  a  comparison  of 
different  FIBR  models  to  a  common  set  of  human 
performance  data  is  a  fruitful  one.  If  it  continues  to  prove 
fruitful  (in  later  rounds  of  the  project),  then  hopefully 
others  will  be  motivated  to  try  this  general  methodology. 
If  that  is  to  be  the  case,  then  we  feel  a  professional 
responsibility  to  share  our  “lessons  learned”  regarding  the 
planning  and  implementation  of  an  FIBR  model 
comparison. 

These  lessons  are  drawn  from  three  sources.  First  is  the 
expert  panel,  who  met  with  the  AMBR  Model 
Comparison  organizers  and  participants  near  the  end  of 
Round  1.  Next  is  feedback  from  the  modeling  teams  who 
have  participated  in  the  project  so  far.  Last  are  our  own 
reflections  on  these  first  two  rounds  of  the  model 
comparison,  as  well  as  conversations  we’ve  had  with 
other  interested  persons  outside  the  project. 
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2.  Expert  Panel 

Near  the  completion  of  Round  1,  a  panel  of  experts  in 
human  performance  model  development  and  evaluation 
was  convened  to  provide  an  appraisal  of  the  results  of  the 
first  round.  The  panel  included  the  following 
distinguished  individuals: 

•  Dr.  Sheldon  Baron,  BBN  Technologies,  retired 

•  Dr.  Wayne  Gray,  Associate  Professor  of 
Psychology,  Human  Factors  &  Applied  Cognitive 
Program,  George  Mason  University 

•  Dr.  Harold  Hawkins,  Program  Manager,  Office  of 
Naval  Research 

•  Dr.  Peter  Poison,  Professor  of  Psychology, 
University  of  Colorado 

The  panelists  volunteered  their  time  and  provided 
valuable  feedback  and  suggestions,  for  which  we  are 
grateful. 

The  panel  was  charged  with  providing  an  evaluation 
covering  the  following  topics:  (a)  critique  of  AMBR 
Round  1  design  and  execution,  (b)  summary  of  the 
strengths  and  weaknesses  of  each  model,  and  (c)  discuss 
issues,  challenges,  and  recommendations  for  future 
rounds  of  the  project. 

2.1  Critique  of  AMBR  Round  1  Design  and  Execution 

2.1.1  Small  sample,  high  variability 

Prior  to  the  expert  panel  meeting,  we  had  collected  two 
sets  of  human  performance  data  for  use  in  the 
comparison.  One  set  (N  =  8)  was  provided  to  the 
modeling  teams  for  “tuning”  their  models.  The  other  set 
(N  =  8),  collected  from  different  participants  doing  an 
analogous  set  of  scenarios,  was  reserved  for  the 
comparison  of  the  models  to  human  performance.  This 
design  was  adopted  partly  to  prevent  “over-tuning”  the 
models  to  a  particular  set  of  comparison  data,  and  partly 
to  provide  a  test  of  the  robustness  of  the  models.  The 
problem  we  ran  into,  and  that  made  it  that  much  more 
challenging  for  the  panel  to  assess  the  “goodness”  of  the 
models’  predictions,  was  that  there  was  a  fair  amount  of 
variability  in  performance  within  each  data  set,  and  one 
group  of  participants  was  better  at  the  task  than  others. 
The  small  sample  and  high  variability  in  the 
“comparison”  data  set  had  the  panel  concerned  about  the 
validity  of  concluding  that  this  sample  was  an  accurate 
representation  of  the  central  tendency  of  human 
performance  in  this  task.  This  of  course  made  it 
impossible  to  engage  in  any  sort  of  head-to-head 


comparison  regarding  which  of  the  models  had  the  best 
“fit”  to  the  data. 

After  the  expert  panel  meeting,  the  team  revised  the  data 
reporting  to  include  both  the  model  development  data  and 
the  comparison  data,  combined  into  a  single  set.  The 
models  were  re-run,  this  time  on  the  development  data 
scenarios,  and  those  runs  were  aggregated  with  the  other 
data  for  comparison  with  the  aggregated  human 
performance  data.  This  modification  resulted  in  a  more 
generalizable  representation  of  the  central  tendency  and 
variability  of  human  performance  in  the  ATC  task.  It  is 
these  data  that  are  reported  in  Tenney  and  Spector  [2]. 

2. 1 .2  Inadequate  assessment  of  robustness 

One  rationale  for  the  original  2-dataset  design  (a  “tuning” 
set  and  a  “comparison”  set)  was  that  the  different 
scenarios  would  serve  as  an  assessment  of  the  robustness 
of  the  models.  During  model  development,  the  modelers 
did  not  even  have  access  to  the  scenarios  used  for 
comparison  data  collection,  so  they  weren’t  sure  what  to 
expect  and  had  to  design  their  models  in  such  a  way  that 
they  would  generalize  to  a  new  set  of  scenarios. 

It  was  made  clear,  however,  that  the  interface  symbology 
and  behavioral  requirements  would  remain  consistent 
across  scenario  sets.  In  fact,  the  only  thing  that  changed 
was  the  location  and  timing  of  the  appearance  of  planes 
on  the  radar  screen.  The  panel  noted  that  this  made  the 
comparison  scenarios  so  similar  to  the  tuning  scenarios, 
that  they  hardly  constituted  a  robustness  check  at  all. 
There  was  a  silver  lining,  however,  in  that  the  similarities 
between  the  two  sets  of  scenarios  made  it  that  much  more 
justifiable  to  combine  them  for  increased  sample  size. 

2.1.3  Limitations  of  aggregate  outcome  data 

The  expert  panel  noted  that  all  of  the  human  performance 
data  to  which  we  were  comparing  the  models’  predictions 
were  aggregate  outcome  data.  No  analysis  was  completed 
at  the  level  of  individual  participants,  and  none  of  the 
analysis  focused  on  the  micro-processes  involved  in 
completing  a  scenario.  The  panel  found  that  having 
human  performance  and  model  data  only  at  the  level  of 
aggregate  outcome  measures  made  it  very  difficult  to 
distinguish  the  models  on  the  basis  of  the  fidelity  of  their 
predictions. 

There  were  two  exceptions  to  this  in  Round  1 .  One  is  that 
CHI  Systems’  CGF-COGNET  model  made  predictions 
about  each  of  the  six  workload  measures  in  the  TLX, 
rather  than  just  the  aggregate  measure.  Another  is  that 
stochastic  performance  characteristics  of  Carnegie 


Mellon’s  ACT-R  model  made  it  possible  for  them  to 
prediet  the  range  of  variability  in  the  human  data,  in 
addition  to  the  eentral  tendeney.  Those  two  models 
distinguished  themselves  by  voluntarily  going  beyond  the 
minimum  predietion  requirements  of  the  eomparison. 
Those  minimum  requirements  all  involved  predietions  of 
the  eentral  tendeney  of  aggregate  outeome  measures 

The  deeision  was  explieitly  made  to  foeus  the  analyses 
and  eomparisons  at  the  aggregate  outeome  level  earlier  in 
the  projeet.  This  was  done  knowing  full  well  that  others 
have  warned  of  the  dangers  of  limiting  analysis  in  this 
manner,  extolled  the  virtues  of  individual  partieipant 
analyses,  and  found  idiographie  analyses  to  be 
informative  with  respeet  to  eognitive  proeess  [3,  4,  5]. 
Nevertheless,  in  this  ease,  pragmatie  eonsiderations  (time, 
funding)  prevailed.  The  faet  is,  taking  the  data  analysis  to 
finer  levels  of  detail  is  more  eostly  and  time-eonsuming 
and  this  reality  ean  not  be  ignored.  We  simply  did  not 
allot  suffieient  time  and  resourees  in  Round  1  for 
idiographie  data  analysis,  and  therefore  were  restrieted  to 
the  more  traditional  nomothetie  approaeh. 

2. 1 .4  Ineomplete  understanding  of  model  implementation 

Another  ehallenge  the  panel  eneountered  in  exeeuting  its 
eharge  was  that  they  were  not  able  to  eome  to  a  eomplete 
enough  understanding  of  eaeh  of  the  models  to  really  feel 
eomfortable  in  doing  an  assessment  of  strengths  and 
weaknesses. 

For  sueh  an  assessment  to  be  aeeurate,  and  for  it  to  have 
any  eonstruetive  influenee,  requires  a  great  deal  of 
knowledge  of  both  the  underlying  modeling  arehiteeture 
and  of  the  speeifie  implementation  of  that  model.  In 
planning  the  agenda  for  the  review,  we  allowed 
approximately  two  hours  of  presentation,  diseussion,  and 
demo  time  per  model.  This  turned  out  to  be  only  enough 
time  to  gain  surfaee-level  familiarity  with  the  models  - 
not  enough  to  aeeomplish  what  we  initially  were  asking 
of  the  expert  panelists. 

It  was,  however,  enough  time  for  the  talented  panel  to 
identify  some  of  the  aeeomplishments  of  eaeh  of  these 
models.  These  are  deseribed  in  the  next  seetion, 
aeeompanied  by  elaborations  of  our  own.  Seetion  2.2 
assumes  some  level  of  familiarity  with  the  models 
developed  in  Round  1,  whieh  ean  be  aehieved  by  reading 
the  model  papers  from  this  symposium  [6,  7,  8,  9]. 

2.2  Summary  of  Modeler  Accomplishments 

2.2.1  ACT-R 


The  ACT-R  team  was  a  late  addition  to  the  projeet.  They 
did  not  know  about  AMBR  during  the  initial  bidding  and 
awarding  of  eontraets.  It  wasn’t  until  Flarold  Flawkins,  of 
the  Ofliee  of  Naval  Researeh,  stepped  up  with  funding  for 
the  ACT-R  team  (at  about  half  the  level  of  funding  of  the 
other  two  funded  teams)  that  they  eommitted  to 
partieipating.  This  happened  approximately  four  months 
before  the  models  were  due,  whieh  put  the  ACT-R  team 
at  a  disadvantage  regarding  model  development  time. 
The  panel  aeknowledged  that  their  aeeomplishments  are 
partieularly  impressive,  given  the  redueed  time  and 
resourees. 

The  ineorporation  of  an  explieit  representation  of 
sensitivity  to  visual  onsets,  whieh  is  new  for  an  ACT-R 
model,  allowed  for  the  possibility  of  task  interruptions, 
and  therefore  inereased  reaetivity  in  the  model.  This  is  an 
important  milestone  for  the  ACT-R  group,  beeause  mueh 
of  the  eognitive  modeling  eommunity  had  assumed  that 
ACT-R’ s  goal-foeused  orientation  preeluded  the 
possibility  of  task  interruptions,  thereby  limiting  the 
utility  of  ACT-R  as  an  arehiteeture  for  modeling  multi¬ 
tasking.  That  the  addition  of  sensitivity  to  visual  onsets 
made  this  possible  serves  as  additional  evidenee  for  the 
modeling  benefits  to  be  gained  by  using  an  “embodied” 
eognitive  arehiteeture. 

Finally,  the  panel  observed  that  the  ACT-R  model’s 
ability  to  multi-task,  as  well  as  its  ability  to  approximate 
the  variability  in  human  performanee  (deseribed  earlier  as 
a  eharaeteristie  that  distinguished  this  model  from  the 
others),  were  both  based  on  previously-existing  symbolie 
and  subsymbolie  eharaeteristies  of  the  arehiteeture.  This 
is  a  good  thing,  from  the  standpoint  of  reusability  and 
generalizability  of  arehiteetural  features.  Given  that  one 
of  the  goals  of  the  model  eomparison  is  to  eneourage 
various  arehiteetures  to  extend  themselves  in  new  ways,  it 
was  not  elear  to  the  panel  whether  this  kind  of  re-use 
should  neeessarily  be  eonsidered  “better”  than  having 
developed  a  new,  speeial-purpose  multi-tasking  eapability 
for  this  model.  Perhaps  the  right  attitude  is  simply  to 
eonsider  it  noteworthy. 

2.2.2  COGNET/iGEN 

Whereas  the  other  arehiteetures  are  motivated  by  the  goal 
of  a  unified  theory  of  human  eognition  and  aetion,  the 
COGNET/iGEN  team  are  quite  explieit  regarding  the  faet 
that  the  purpose  for  development  of  their  modeling 
arehiteeture  is  not  to  put  forth  new  theory.  Rather,  their 
goal  is  to  develop  an  expert  system  shell  with  praetieal 
applieability  in  a  wide  variety  of  modeling  eontexts.  Just 
as  is  the  ease  for  the  theory-motivated  arehiteetures,  the 
panel  noted  that  the  reeent  addition  of  sensory-motor 


representations  in  CGF-COGNET  is  an  important 
extension  of  their  arehiteeture.  It  is  eonsistent  with  the 
general  trend  toward  “embodimenf’  in  eognitive 
modeling  [10,  11,  12]. 

The  COGNET/iGEN  model  stood  out  from  the  erowd  by 
way  of  its  ability  to  link  eaeh  of  the  six  NASA-TLX 
workload  sub-measures  to  quantitative  eomponents  of  the 
model.  The  panel  aeknowledged  this  as  an 
aeeomplishment  for  the  model,  and  also  suggested  that 
the  very  faet  that  it  was  possible  serves  to  inerease  the 
eonstruet  validity  of  the  TLX  sub-measures.  So  this  was 
a  win  on  two  fronts. 

Finally,  the  panel  observed  that  the  CGF-COGNET 
variant  used  for  this  model  ineludes  the  addition  of  a 
separate  knowledge  type  (metaeognitive  knowledge), 
whieh  in  eonjunetion  with  deelarative  and  proeedural 
knowledge,  enables  the  model  to  multi-task.  This 
reeeived  mixed  reviews  from  the  panel.  On  the  one  hand, 
it  elearly  makes  for  an  effeetive  means  of  managing 
aetivity  during  multi-tasking,  while  on  the  other  hand  the 
panel  questioned  the  theoretieal  parsimony  of  a  separate 
metaeognitive  knowledge  meehanism. 

2.2.3  D-COG 

The  panel  found  AFRL’s  D-COG  model  to  be  an 
interesting  new  approaeh  to  building  a  eognitive 
arehiteeture.  The  D-COG  team  were  eommended  for 
introdueing  a  novel  arehiteeture  that  had  the  potential  to 
address  some  new  elasses  of  issues.  A  eonsequenee  of 
ereating  an  arehiteeture  with  a  design  that  really  breaks 
new  ground,  is  that  it  is  even  more  diffieult  than  usual  to 
understand  that  design,  beeause  many  of  the  underlying 
representational  and  proeessing  assumptions  are  novel  as 
well.  The  panel  mostly  had  questions  about  the  D-COG 
model,  and  preeious  few  eonelusions.  It  was  never  elear 
to  the  panel  exaetly  how  multi-tasking  was  implemented 
in  D-COG,  or  how  the  arehiteeture  ean  be  used  to  arrive 
at  response  time  predietions. 

Although  the  eore  eonstmets  were  in  plaee,  the  aetual 
implementation  of  the  D-COG  arehiteeture  was  under 
development  even  as  the  Round  1  model  was  being 
developed.  This  makes  it  all  that  mueh  more  impressive 
that  the  D-COG  team  managed  to  get  a  model  eompleted, 
and  should  be  eonsidered  an  aeeomplishment  in  itself 

2.2.4  EPIC-Soar 

In  addition,  the  EPIC-Soar  model  developed  for  AMBR 
Round  1  broke  from  the  Soar  tradition  of  a  single 
proeedural  long-term  memory  store  and  added  a 


representation  of  long-term  deelarative  memory  that 
ineluded  ACT-R’s  deelarative  knowledge  deeay 
funetions.  This  synthesis  of  elements  from  three  different 
existing  arehiteetures  was  eonsidered  by  the  panel  to  be 
an  exeiting  aeeomplishment.  It  also  is  an  example  of  the 
kind  of  produetive  eross-fertilization  that  ean  result  from 
bringing  different  modeling  arehiteetures  to  bear  on 
eommon  human  performanee  ehallenges. 

The  EPIC-Soar  model  ineludes  a  “visualization  tool” 
layer  that  provides  real-time  information  about  the  eurrent 
foeus  of  the  model’s  visual  attention  and  eognitive 
aetivity  as  the  model  is  running.  The  panel  was 
enthusiastie  about  this  as  a  development,  debugging,  and 
demonstration  tool. 

The  EPIC-Soar  approaeh  to  modeling  multitasking 
eonsisted  of  adding  a  eapability  for  task  interruption,  but 
no  explieit,  separate  representation  for  multitasking  per 
se. 

2.3  An  Issue,  a  Challenge,  and  a  Recommendation  for 
Future  Rounds 

The  panel  finished  off  its  eonelusions  with  an 
issue/ehallenge/reeommendation  triumvirate  that 
addresses  the  essenee  of  what  we  found  diffieult  in  Round 
1  of  the  AMBR  Model  Comparison. 

The  primary  issue  we  faeed  throughout  the  projeet  was 
identifying  speeifie  points  of  eomparison  among  the 
models.  Attempts  at  identifying  speeifie  points  of 
eomparison  oeeurred  throughout  Round  1.  These 
diseussions  happened  at  a  eouple  of  group  meetings 
seheduled  shortly  after  modeling  eontraets  were  awarded, 
and  also  by  email  and  phone,  and  eventually  some 
eonsensus  was  reaehed.  Prior  to  the  expert  panel  review, 
the  Moderator  provided  a  set  of  topies  to  the  modelers 
that  they  were  to  address  in  their  presentations.  These 
ineluded  the  following: 

•  Overview  of  model,  deseribing  unique  features 

•  Theory/arehiteeture  on  whieh  model  is  based 

•  Deseription  of  how  the  model  works 

•  Psyehologieal  findings,  assumptions  and  intuitions 
underlying  your  model 

•  Unique  ehallenges  of  this  task  and  how  they  were 
handled 

•  Approaeh  to  model  development 

•  Demo  of  the  model  on  a  eommon  seenario 


Going  into  the  panel  review,  this  seemed  like  a  fine  list  of 
topics  for  the  modeling  team  presentations,  and  each  was 
also  intended  as  a  point  of  comparison  among  the  models. 
To  their  credit,  the  modeling  teams  all  did  an  admirable 
job  of  addressing  each  point.  What  we  failed  to 
appreciate  beforehand  was  how  difficult  it  was  going  to 
be  to  bring  everyone  up  to  speed  on  the  first  three  topics 
(unique  features,  underlying  theory,  and  how  the  model 
works)  in  the  allotted  time.  We  mentioned  this  point 
earlier,  and  don’t  want  to  belabor  it,  but  it  is  extremely 
challenging  to  get  a  group  of  people  to  a  deep  level  of 
understanding  of  four  different  modeling  architectures 
and  the  details  of  four  different  models  built  within  those 
architectures  -  all  in  two  days. 

So  even  with  these  points  of  comparison  identified,  the 
remaining  challenge  is  to  actually  follow  through  with  the 
comparison.  This  requires  some  real  understanding  of 
those  models  and  their  underlying  architectures,  which  we 
did  not  successfully  achieve  in  such  a  short  time.  This  is 
where  the  process  broke  down.  The  entire  panel  review 
was  spent  trying  to  achieve  a  sufficient  level  of 
understanding,  and  we  never  actually  managed  the  direct 
comparison  across  models  that  had  been  intended. 

A  recommendation  from  the  panel  for  addressing  this 
challenge  in  future  rounds  of  the  project  is  to  get  panel 
members  involved  earlier.  If  they  have  more  knowledge 
of  the  modeling  focus,  task,  participating  modeling 
architectures,  and  the  developed  models  prior  to  the  panel 
review,  then  perhaps  a  sufficient  level  of  understanding 
can  be  achieved  more  quickly.  This  should  facilitate 
following  through  more  completely  with  the  direct 
comparison. 

3,  Feedback  from  Modeling  Teams 

As  Round  1  drew  to  a  close,  and  we  began  to  plan  for  the 
subsequent  rounds,  a  solicitation  went  out  to  the  modeling 
teams  for  feedback  on  Round  1.  Their  replies  identified 
three  areas  of  concern  regarding  how  Round  1  was 
designed  and  implemented. 

3.1  Access  to  Simulation  Code  and  Human  Data 

A  significant  point  of  contention  for  the  modeling  teams 
was  the  fact  that  the  task  code  and  the  human 
performance  data  were  not  available  at  the  time  contracts 
were  awarded.  Not  having  access  to  the  task  code  was 
frustrating  for  the  modelers,  who  were  interested  in 
beginning  to  hook  their  architectures  into  the  simulation 
code  as  early  as  possible.  One  of  the  requirements  was 
that  the  models  actually  interface  with  the  same  task  the 
humans  were  using  during  data  collection,  so  it  is  of 


course  perfectly  reasonable  that  the  modelers  wanted 
access  to  the  code.  Not  having  the  code  frozen  and  ready 
for  delivery  at  the  time  of  contract  awards  was  a 
programmatic  oversight  that  we  plan  to  avoid  in  the 
future. 

Related  to  this  was  the  complaint  that  human  performance 
data  were  provided  rather  late  in  the  contract  period. 
Continuing  software  development,  decision  making  about 
performance  measures,  and  pilot  testing  all  delayed 
completion  of  the  simulation  code,  which  of  course  also 
delayed  delivery  of  the  human  performance  data.  Clearly, 
the  modelers  managed  to  overcome  these  issues  in  the 
end,  but  the  preference  for  earlier  delivery  of  both 
simulation  code  and  human  data  came  through  clearly  in 
the  feedback. 

3.2  Lack  of  a  Common  CTA 

There  was  some  concern  (although  not  unanimous) 
regarding  the  fact  that  there  was  no  centralized  cognitive 
task  analysis  (CTA)  on  which  model  behaviors  could  be 
based.  This  meant  that,  one  way  or  another,  each 
modeling  team  had  to  complete  their  own  CTA.  The 
effect  of  this  is  that  it  creates  another  source  of  variance 
among  the  models  that  makes  it  that  much  harder  to 
compare  them  on  a  case-by-case  basis.  It  adds  a 
knowledge  level  confound  into  any  direct  comparisons 
among  the  models,  such  that  in  addition  to  architectural 
features  and  theoretical  assumptions,  it  becomes 
necessary  to  compare  differences  in  task  knowledge  and 
strategies  that  are  built  into  the  models. 

3.3  Grain  Size  of  the  Performance  Analysis 

Finally,  two  points  came  up  in  the  model  team  feedback 
regarding  the  grain  size  of  the  performance  analysis.  The 
first  was  that  it  would  have  been  useful  to  have  eye 
movement  data  from  even  a  small  sample  of  participants. 
This  can  help  constrain  the  models  and  ameliorate 
concerns  associated  with  the  lack  of  a  common  CTA. 

The  second  point  was  the  same  one  that  came  up  in  the 
discussion  of  feedback  from  the  expert  panel,  regarding 
individual  vs.  aggregate  data  analysis.  Since  this  concern 
was  discussed  in  section  2.1.3,  we  won’t  revisit  the  issue 
in  any  detail  here.  The  following  email  excerpt  from  one 
of  the  modelers  sums  the  point  up  nicely: 

“...  judging  the  quality  of  such  a  model  merely  by 
comparison  to  summary  outcome  measures  ignores 
various  levels  of  verisimilitude  and  associated  model 
utility  which  could  otherwise  be  examined.” 


4,  Implications  for  Future  Comparisons 

To  this  point,  we  have  shared  a  variety  of  ehallenges  from 
and  pieees  of  feedbaek  regarding  the  first  round  of  the 
eomparison.  What  do  we  eonelude  from  these  points,  and 
what  are  the  implieations  of  those  eonelusions  for  future 
eomparisons  (in  the  AMBR  projeet  or  other  HBR  model 
eomparisons)? 

4.1  Programmatic  Concerns 

One  major  grouping  of  the  points  made  earlier  in  the 
paper  ean  fall  under  the  heading  of  “programmatie 
eoneems.”  These  are  issues  related  to  the  organization, 
seheduling,  and  management  of  the  eomparison. 

4.1.1  Timely  aeeess  to  simulation  eode  and  human  data 

Although  we  had  an  intelleetual  understanding  of  the 
importanee  of  timely  aeeess  to  simulation  eode  and 
human  data  prior  to  Round  1,  the  projeet  agenda  in  Round 
1  did  not  refleet  this  understanding.  As  a  result,  there  was 
frustration  and  floundering  among  the  modelers  for  a 
while,  as  the  simulation  eode  was  being  eompleted,  then 
as  the  human  performanee  data  were  eolleeted. 

The  implieation  for  future  eomparisons  is  that  delivery  of 
frozen  simulation  eode  should  oeeur  immediately  after  the 
awarding  of  modeler  eontraets  and  delivery  of  human 
performanee  data  should  follow  shortly  thereafter.  The 
entire  program  agenda  should  be  designed  with  this  in 
mind. 

4. 1 .2  Common  CTA 

Although  one  ean  always  make  a  valid  argument  for 
providing  a  eommon  CTA  for  a  model  eomparison,  we 
tend  to  think  that  the  importanee  of  this  inereases  with  the 
eomplexity  of  the  task  and  the  amount  of  training  required 
to  aehieve  adequate  eompetenee.  Thus,  to  the  extent  that 
the  simulation  environment  is  more  toward  the  “lab  task” 
end  of  the  eontinuum,  a  eommon  CTA  is  less  important. 

The  jury  is  still  out  on  this,  and  deliberations  are  likely  to 
eontinue,  but  we  tend  to  think  that  future  AMBR 
eomparisons  also  will  not  involve  a  eommon  CTA. 

4.1.3  Involvement  of  the  expert  panelists 

We  think  it  is  likely  that  the  model  eomparison  would 
benefit  from  inereased  involvement  of  the  panelists,  and 
we  intend  to  get  them  involved  earlier,  and  on  a  more 
eontinuous  basis,  throughout  future  model  eomparisons. 


This  is  likely  to  take  a  variety  of  forms,  ineluding  (a) 
partieipation  on  the  model  team  seleetion  eommittee,  (b) 
eonsultations  regarding  empirieal  design  and  points  of 
model  eomparison  early  in  eaeh  round,  (e)  a  2-stage 
review  proeess  involving  the  expert  panel  in  whieh  an 
initial  assessment  of  the  models  is  made,  then  the 
modelers  have  an  opportunity  to  make  improvements  to 
the  models  before  eonvening  for  the  final  eomparison, 
and  (d)  an  extra  day  added  to  the  end  of  the  final 
eomparison,  dedieated  to  following  through  on  the  points 
of  eomparison  identified  earlier  in  the  round. 

4.2  Empirical  Concerns 

A  second  major  grouping  of  the  points  made  earlier  in  the 
paper  can  fall  under  the  heading  of  “empirical  concerns.” 
These  are  issues  related  to  the  experimental  design,  data 
analysis,  and  model  assessment. 

4.2.1  Data  sources  and  analyses 

We  are  increasingly  convinced  that  a  combination  of 
individual  and  aggregate  data,  collected  from  as  wide  a 
variety  of  sources  (computer  log  files,  eye  movements, 
verbal  protocols)  as  is  possible  given  the  time  and  funding 
constraints  of  the  project,  provides  the  best  hope  for  truly 
stressing  the  predictive  boundaries  of  human  modeling 
architectures.  Stressing  these  boundaries  is  likely  to 
illuminate  shortcomings  and  increase  the  probability  of 
further  advancements  in  the  state  of  the  art.  Future 
AMBR  model  comparisons  are  likely  to  include  a  wider 
variety  of  data,  analyzed  at  both  individual  participant  and 
aggregate  levels. 

4.2.2  Model  robustness 

The  scenarios  represented  in  the  evaluation  data  were  too 
similar  to  those  for  the  development  data  to  allow  tests  of 
the  robustness  of  the  models.  Flowever,  improving  on 
this  weakness  is  not  straightforward.  On  the  one  hand 
each  model  was  designed  to  represent  specific  task 
requirements.  To  present  scenarios  that  stretch  the  model 
demands  beyond  the  specified  task  requirements  seems 
“unfair”  and,  to  some  extent,  uninformative. 

It  does  seem  “fair”  to  specify  to  the  developers  the 
“envelope”  within  which  task  parameters  might  vary,  and 
to  use  as  large  an  envelope  as  possible,  given  the  project's 
time  and  funding  constraints,  but  not  to  specify  where  in 
that  envelope  the  specific  scenarios  will  be  selected  for 
evaluation.  Then  the  development  data  and  the  evaluation 
data  could  be  drawn  from  different  regions  of  that 
envelope.  It  also  might  be  appropriate  and  insightful  to 


formulate  one  “ehallenge  seenario”  that  takes  the 
modeling  requirements  judieiously  outside  the 
“envelope,”  but  to  go  beyond  that  does  not  seem 
produetive.  Exaetly  how  one  identifies  the  boundaries  of 
this  envelope,  and  what  it  means  to  be  “judieiously 
outside”  the  envelope  remain  unspeeified. 

4,2.3  Inereasing  understanding  of  the  models 

As  diseussed  earlier,  aetually  following  through  on  some 
form  of  head-to-head  model  eomparisons  requires  first 
that  those  making  the  assessment  have  a  thorough 
understanding  of  the  implementation  of  the  models.  We 
have  already  mentioned,  in  the  seetion  on  programmatie 
eoneems,  that  we  will  strive  for  inereased  expert  panel 
partieipation  in  future  rounds.  What  else  ean  we  do  to 
bring  about  a  deeper  understanding  of  the  models  for 
purposes  of  the  eomparison?  Here  are  several  ideas 
eurrently  under  diseussion  for  future  rounds: 

(a)  A  eode  walk-through  was  suggested  (by  the  Round  1 
panel,  in  faet)  as  possibly  a  useful  eomponent  of  future 
reviews.  Clearly,  the  utility  of  this  depends  on  the 
teehnieal  baekground  of  the  panel  members,  so  the 
deeision  whether  to  do  this  is  primarily  up  to  them. 

(b)  Charaeterize  eaeh  model  in  terms  of  the  number  of 
fixed  and  variable  parameters.  Alternatively,  partition  the 
parameters  into  those  that  eharaeterize  the  task  and  the 
working  environment  and  those  that  refieet  the  human 
representation,  and  then  deseribe  the  subset  of  the  human 
parameters  that  were  “adapted”  to  the  speeifies  of  the  task 
and  development  data  rather  than  fixed  a  priori. 
However,  models  differ  in  the  extent  to  whieh  they 
independently  eharaeterize  the  task  environment  and  the 
human  performanee,  so  that  sueh  a  partitioning  would  not 
always  be  possible  or  even  sensible. 

(e)  Require  the  model  developers  to  undertake  sensitivity 
analyses  to  relate  the  fit  of  their  models  to  the 
performanee  data  as  a  funetion  of  the  setting  of  one  or 
more  eritieal  parameters  of  the  model. 

(d)  Eneourage  eaeh  modeler  to  inelude  an  interfaee  that 
makes  the  internal  proeessing  of  the  model  transparent  to 
an  observer.  The  EPIC-Soar  team  provided  a  dynamie 
pietorial  representation  of  the  eye  sean  patterns  being 
undertaken  by  the  model  together  with  status  lights  that 
indieated  the  elass  of  the  aetivity  that  was  being 
undertaken  at  eaeh  moment  in  time.  These  features  were 
very  helpful  in  understanding  the  sequeneing  of 
interruptions  and  aetivities  that  were  being  undertaken. 


(e)  Ask  the  modelers  to  eollaborate  with  the  moderator 
team  to  ereate  a  eomprehensive  table  of  features  aeross 
models.  This  strueture  should  be  developed  early  in  the 
projeet  so  that  it  ean  be  used  by  all  the  developers  to 
eatalog  their  models  at  the  time  of  the  eomparison  and 
evaluation. 

We  hope  this  list  and  the  earlier  deseription  of  Round  1 
ehallenges  and  feedbaek  will  serve  as  fodder  for 
diseussion  among  others  interested  in  hosting  an  HER 
model  eomparison.  The  paper  now  turns  to  a  brief 
deseription  of  the  future  of  the  AMBR  Model 
Comparison. 

5,  Current  Plans  for  Future  Comparisons 

Pew  and  Mavor  [13]  note  that,  despite  the  faet  that  there 
is  an  enormous  literature  on  memory  and  learning  in  the 
experimental  psyehology,  eognitive  psyehology,  and 
eognitive  seienee  fields,  and  despite  the  faet  that  this 
researeh  is  relevant  to  the  representation  of  human 
behavior  in  military  simulations,  “  .  .  .  eurrent  military 
simulations  make  little  or  no  use  of  learning  models”  (p. 
148).  It  is  the  need  for  advaneements  in  the  state  of  the 
art  for  modeling  learning  proeesses  that  has  persuaded  us 
to  foeus  this  model  eomparison  on  models  of  concept 
learning  in  the  eontext  of  the  ATC  task. 

5.1  Round  3:  Concept  Learning 

The  task  for  Round  3  will  retain  the  multi-tasking 
pereeptual-motor  features  of  the  air  traffie  eontrol  (ATC) 
task  used  in  AMBR  Rounds  1  and  2.  This  task  is 
deseribed  in  detail  in  Tenney  and  Speetor  [2].  It  is  still 
undeeided  whether  the  task  will  retain  the  HLA-based 
federation  arehiteeture  developed  for  AMBR  Round  2 
(the  learus  Federation),  or  the  task  will  return  to  the  non- 
HLA  format. 

The  signifieant  ehange  in  the  task  from  Round  1  to  Round 
2  will  be  that  in  plaee  of  the  speed  query,  it  will  eontain 
an  embedded  eoneept  learning  task.  Multiple  aireraft  will 
query  the  eontroller  (the  eontroller  that  is  being  modeled) 
about  the  possibility  of  ehanging  altitude.  The  eontroller 
will  make  a  deeision  to  authorize  an  altitude  ehange  based 
on  a  multi-dimensional  attribute  matrix  that  might  inelude 
dimensions  like  aireraft  size,  level  of  atmospherie 
turbulenee,  and  eurrent  altitude.  The  eontroller  must  learn 
the  appropriate  responses  on  the  basis  of  feedbaek 
reeeived  through  the  user  interfaee  eoneeming  whether 
they  made  a  eorreet  deeision  or  not.  This  feature  strueture 
of  this  eoneept  learning  task  is  based  on  the  laboratory 
study  by  Shepard,  Hovland  and  Jenkins  [14],  and 
modeling  studies  reported  by  Nosofsky  et  al.  [15]. 


5.2  Round  4:  More  on  Concept  Learning 

The  task  for  Round  4  will  be  fundamentally  similar  to  the 
task  used  in  Round  3,  but  the  details  are  still  under 
eonsideration.  Based  on  the  results  of  the  Round  3  model 
evaluations,  the  Round  4  task  will  be  designed  to  further 
stress  the  models  and  examine  their  eapabilities 
(modeling  team  eontraets  will  extend  through  both 
Rounds  3  and  4).  We  antieipate  a  foeus  on  the  ability  of 
models  to  adapt  from  one  set  of  learned  eoneepts  to  a 
new,  ehanged  set  of  eoneepts  based  on  the  same  or  a 
similar  set  of  eoneept  attributes.  Other  manipulations 
sueh  as  the  workload  of  the  pereeptual  motor  task  may 
also  be  explored  as  deemed  appropriate  given  the  results 
of  Round  3. 
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Overview 


*  HBR  Accomplishments 

*  Feedback  from  Expert  Panel 

*  Feedback  from  Modelers 

*  Implications  for  Future  Comparisons 

*  AMBR  Rounds  3  and  4 
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HBR  Accomplishments 


•  ACT-R 

—  Unit  task  framework  for  multi-tasking 
—  Visual  onsets/color  changes 
—  Workload  computation 

•  COGNET/iGEN 

—  First  application  of  CGF  (i.e.,  embodied)  architecture 
—  Detailed  workload  computation 

•  D-COG 

—  Opportunity  to  build  and  test  the  architecture 

•  EPIC-Soar 

—  Most  dynamic/complex  application  to  date 
—  Declarative  base-level  activation  and  decay 
—  Workload  computation 

•  All 


Transferred  model  to  HLA  federation 
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Feedback  from  Expert  Panel 


*  Small  sample,  large  individual  differences 

*  Limited  utiiity  of  aggregate  outcome  data 

*  Weak  assessment  of  robustness 


*  incomplete  understanding  of  model  implementations 


Feedback  from  Modelers 


*  Want  earlier  access  to  simulation  code  and  human  data 


*  Possibly  beneficial  to  have  a  common  CTA 


Grain-size  of  the  behavior  analysis 


Implications  for  Future  Comparisons 


*  Programmatic  Changes 

—  Earlier  access  to  simulation  and  data  (definitely) 

—  Increased  involvement  of  the  expert  panelists  (definitely) 

—  Generate  common  CTA  (under  consideration) 


•  Empirical  Changes 

—  Aggregate  and  individual  analyses  (definitely) 

—  Increase  understanding  of  the  models  (definitely) 

—  High-density  data  (probably) 

—  Stronger  test  of  model  robustness  (under  consideration) 


AMBR  Rounds  3  and  4 


Modeling  Focus 

Category  Learning 

Context 

Modified  version  of  ATC  task 


Controller  must  learn  to  respond 
appropriately  to  requests  for  altitude  change 


•A-. 

i 

i  *■ 
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Moderator 

BBN  (Pew,  Deutsch,  Oilier,  Tenney,  Specter,  Benyo) 


Modelers 

{To  be  selected} 
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Questions? 

Suggestions? 

Comments? 


(aii  are  weicome) 
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